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Abstract 

Future  CMPs  will  have  more  cores  and  greater  on- 
chip  cache  capacity.  The  on-chip  cache  can  either  be 
divided  into  separate  private  L2  caches  for  each  core, 
or  treated  as  a  large  shared  L2  cache.  Private  caches 
provide  low  hit  latency  but  low  capacity,  while  shared 
caches  have  higher  hit  latencies  but  greater  capacity. 
Victim  replication  was  previously  introduced  as  a  way 
of  reducing  the  average  hit  latency  of  a  shared  cache 
by  allowing  a  processor  to  make  a  replica  of  a  pri- 
mary  cache  victim  in  its  local  slice  of  the  global  L2 
cache.  Although  victim  replication  performs  well  on 
multithreaded  and  single-threaded  codes,  it  performs 
worse  than  the  private  scheme  for  multiprogrammed 
workloads  where  there  is  little  sharing  between  the  dif¬ 
ferent  programs  running  at  the  same  time. 

In  this  paper,  we  propose  victim  migration,  which  im¬ 
proves  on  victim  replication  by  adding  an  additional  set 
of  migration  tags  on  each  node  which  are  used  to  im¬ 
plement  an  exclusive  cache  policy  for  replicas.  When  a 
replica  has  been  created  on  a  remote  node,  it  is  not  also 
cached  on  the  home  node,  but  only  recorded  in  the  mi¬ 
gration  tags.  This  frees  up  space  on  the  home  node  to 
store  shared  global  lines  or  replicas  for  the  local  pro¬ 
cessor. 

We  show  that  victim  migration  performs  better  than 
private,  shared,  and  victim  replication  schemes  across 
a  range  of  single  threaded,  multithreaded,  and  multi¬ 
programmed  workloads,  while  using  less  area  than  a 
private  cache  design.  Victim  migration  provides  a  re¬ 
duction  in  average  memory  access  latency  of  up  to  10% 
over  victim  replication. 

1.  Introduction 

Future  chip-scale  multiprocessors  (CMPs)  are  likely 
to  continue  to  increase  both  the  number  of  cores  and 
the  total  cache  capacity  on  chip.  Off-chip  misses  will 
remain  expensive,  but  increases  in  clock  frequency  to¬ 
gether  with  worsening  relative  wire  delays  will  also  in¬ 


crease  latencies  for  cross-chip  communication,  reaching 
tens  of  clock  cycles  in  future  technologies  [1,  12].  Effec¬ 
tive  use  of  on-chip  cache  must  therefore  consider  both 
the  cost  of  off-chip  misses  and  the  cost  of  cross-chip 
communications. 

Two  baseline  outer-level  cache  designs,  private  and 
shared,  illustrate  the  trade-offs  between  these  two  com¬ 
ponents  of  effective  memory  access  latency.  A  private 
design  dedicates  a  slice  of  the  on-chip  L2  cache  stor¬ 
age  as  a  private  L2  cache  for  each  processor  core.  The 
shared  design  aggregates  all  the  L2  cache  capacity  to 
form  a  single  L2  cache  shared  by  all  the  cores. 

The  private  design  has  low  L2  hit  latency,  as  the  pri¬ 
vate  L2  is  physically  co-located  with  the  processor  core 
and  has  much  smaller  area  than  a  shared  cache.  This 
provides  good  performance  provided  the  working  set  fits 
within  the  local  L2  slice.  The  disadvantage  of  the  pri¬ 
vate  L2  scheme  is  that  effective  on-chip  cache  capac¬ 
ity  is  reduced  for  shared  data,  as  each  core  must  retain 
its  own  copy  of  any  shared  data  block.  Also,  the  fixed 
partitioning  of  resources  does  not  allow  a  thread  with  a 
larger  working  set  to  “borrow”  L2  cache  capacity  from 
the  private  caches  of  other  processors  hosting  threads 
with  smaller  working  sets. 

The  shared  design  reduces  the  off-chip  miss  rate  for 
large  shared  working  sets,  as  only  a  single  on-chip  copy 
is  required  for  any  shared  data.  However,  large  shared 
L2  caches  have  worse  access  latency  than  a  small  pri¬ 
vate  L2  cache  and  can  suffer  from  inter-thread  cache 
conflicts. 

Many  studies  have  shown  that  either  private  or  shared 
caches  can  considerably  outperform  the  other,  depend¬ 
ing  on  the  specific  characteristics  of  the  workload  [27, 
15,  8,  15],  This  observation  has  motivated  several  pro¬ 
posals  to  develop  hybrid  architectures  that  retain  the  ad¬ 
vantages  of  both  private  and  shared  designs. 

A  number  of  proposals  seek  to  reduce  the  effective 
access  latency  of  a  large  shared  cache  by  adopting  a 
non-uniform  cache  access  (NUCA)  architecture.  Cur¬ 
rent  shared  L2  caches  [4,  18,  25]  are  constructed  using 
a  “dancehall”  configuration,  as  shown  in  the  left  of  Fig- 


ure  1,  where  processors  with  private  LI  caches  are  on 
one  side  of  an  interconnect  crossbar  and  a  banked  shared 
L2  cache  is  on  the  other.  To  reduce  access  latency  and 
energy  consumption,  large  caches  are  physically  divided 
into  many  small  banks  or  slices.  Current  designs  use  a 
fixed  worst-case  latency  to  access  any  slice,  but  as  cross¬ 
chip  latencies  grow,  this  will  result  in  unacceptable  hit 
times.  NUCA  [17]  designs  allow  access  latency  to 
vary  depending  on  the  relative  placement  of  the  proces¬ 
sor  and  L2  slice  containing  the  data.  Dynamic  NUCA 
schemes  have  been  proposed  for  uniprocessors  [17,  7], 
where  frequently-accessed  cache  blocks  gradually  mi¬ 
grate  closer  to  the  processor.  These  schemes  are  con¬ 
siderably  more  complicated  when  applied  to  multipro¬ 
cessors  with  dancehall  configurations  [5,  8,  15],  These 
schemes  require  some  form  of  duplicated  L2  tag  kept  lo¬ 
cal  to  each  processor  to  reduce  the  number  of  slices  that 
must  be  searched  to  locate  an  on-chip  block.  Further,  all 
such  local  tags  must  be  kept  consistent  with  any  block 
migration  triggered  by  a  remote  processor,  imposing  ad¬ 
ditional  serialization  constraints  on  otherwise  indepen¬ 
dent  cache  accesses  [5,  8,  15], 

An  alternative  base  structure  is  a  tiled  CMP  as  shown 
on  the  right  side  of  Figure  1,  where  the  CMP  is  orga¬ 
nized  as  an  array  of  replicated  tiles  connected  over  a 
switched  network.  Each  tile  contains  a  processor  with 
LI  caches,  a  slice  of  the  L2  cache,  and  a  connection  to 
the  on-chip  network.  Tiled  CMPs  scale  well  to  larger 
processor  counts  and  can  easily  support  families  of  prod¬ 
ucts  with  varying  numbers  of  tiles,  including  the  op¬ 
tion  of  connecting  multiple  separately  tested  and  speed- 
binned  die  within  a  single  package.  A  tiled  CMP  is  a 
natural  structure  for  private  L2  caches,  but  can  also  be 
used  to  implement  a  shared  L2  cache  [27].  Recent  work 
has  shown  how  a  victim  replication  fVR)  scheme  can  re¬ 
duce  the  effective  hit  latency  of  a  tiled  shared  L2  cache 
by  placing  a  copy  of  an  evicted  LI  cache  block  in  the  lo¬ 
cal  slice  of  the  L2  instead  of  sending  it  back  to  the  home 
tile  [27],  If  the  processor  requests  the  same  block  again, 
it  can  now  access  the  local  copy  instead  of  the  remote 
original.  This  approach  provides  many  of  the  same  ben¬ 
efits  as  a  dynamic  NUCA  scheme,  but  with  much  less 
complexity. 

This  earlier  study  shows  that  victim  replication  has 
better  overall  performance  than  either  private  or  shared 
schemes  for  multi-threaded  and  single-threaded  work¬ 
loads  [27].  However,  as  we  show  in  the  evaluation 
section,  the  victim  replication  scheme  does  not  per¬ 
form  as  well  as  private  caches  when  running  a  multi- 
programmed  workload,  where  each  processor  is  running 
an  independent  program.  The  problem  in  this  case  is  that 
private  data  is  placed  on  chip  twice,  once  at  the  home 
tile  and  once  as  a  replica  at  the  processor  using  the  data. 


This  reduces  effective  on-chip  capacity  and  causes  con¬ 
flicts  between  independent  threads. 

In  this  paper,  we  introduce  victim  migration  (VM), 
which  improves  on  victim  replication  by  providing  ad¬ 
ditional  migration  tags  at  the  home  tile.  These  additional 
tags  are  used  to  implement  an  exclusive  cache  policy  for 
replicas,  where  the  regular  tag  and  data  storage  of  the 
home  tile’s  L2  slice  is  freed  when  a  replica  is  present  on 
a  remote  tile.  The  newly  vacated  home  block  can  then 
be  reused  to  hold  a  different  shared  block  or  a  replica 
for  the  processor  local  to  the  home  tile,  while  the  mi¬ 
gration  tag  points  to  the  remote  replica  data.  We  show 
how  the  total  area  overhead  of  this  scheme  is  less  than 
for  a  private  cache  scheme,  while  providing  the  best 
performance  over  all  workloads  (single-threaded,  multi¬ 
threaded,  and  multi-programmed)  as  compared  to  pri¬ 
vate,  shared,  or  victim  replication  schemes. 

2.  Baseline  Tiled  CMP  Designs 

In  this  section,  we  describe  three  baseline  L2  cache 
configurations:  private,  shared,  and  victim  replication. 
The  tiled  CMP  model  used  is  shown  in  Figure  1.  Ad¬ 
ditional  assumptions  about  the  model  are  as  follows:  1) 
The  primary  instruction  and  data  caches  are  kept  small 
and  private  to  give  the  lowest  access  latency,  2)  The  lo¬ 
cal  L2  cache  slice  is  tightly  coupled  to  the  rest  of  the  tile 
and  is  accessed  with  a  fixed  latency  pipeline.  The  tag, 
status,  and  directory  bits  are  kept  separate  from  the  data 
arrays  and  close  to  the  processor  and  router  for  quick  tag 
resolution,  3)  Accesses  to  L2  cache  slices  on  other  tiles 
travel  over  the  on-chip  network  and  experience  varying 
access  latencies  depending  on  inter-tile  distance  and  net¬ 
work  congestion,  4)  A  generic  four-state  MESI  protocol 
with  reply-forwarding  is  used  as  the  baseline  protocol 
for  on-chip  data  coherence.  Each  CMP  design  uses  mi¬ 
nor  variants  of  this  protocol. 

2.1.  Private  Design 

In  the  private  design  (Figure  2),  the  processor  uses 
the  local  L2  slice  as  a  private  L2  cache,  as  is  the  case  for 
several  commercial  CMPs  [6,  16].  This  is  equivalent  to 
simply  shrinking  a  traditional  multi-chip  multiprocessor 
onto  a  single  chip.  Snoopy  protocols  scale  poorly  to  fu¬ 
ture  technologies  and  large  numbers  of  nodes,  and  so  we 
employ  a  high-bandwidth  on-chip  directory  to  maintain 
coherence  among  the  private  L2  caches. 

The  directory  is  held  as  a  duplicate  set  of  L2  tags  dis¬ 
tributed  across  tiles  by  address  [4],  For  each  processor 
accessing  a  particular  cache  block,  a  copy  of  the  block 
must  be  resident  in  its  private  L2  cache,  such  as  block 
i  shown  in  Figure  2.  In  addition,  an  on-chip  directory 
holding  an  entry  for  block  i  is  stored  at  blk  i’s  home  tile. 
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Figure  1 .  Dancehall  versus  tiled  chip  multiprocessor  designs. 
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Figure  2.  The  private  L2  design  treats  each  L2  slice  as  a  private  cache. 


Figure  3  shows  a  simplified  two-tile  example  of  how  this 
scheme  works.  Each  tile  has  a  direct-mapped  L2  cache 
with  four  cache  blocks.  To  determine  address  A’s  status 
on-chip,  we  use  the  lower  bits  of  the  index,  the  home 
select  bits,  to  find  A’s  home  tile.  The  remaining  bits 
of  the  index  are  used  to  find  the  duplicate  tag  entry  cor¬ 
responding  to  A  in  the  directory  on  the  home  tile.  This 
entry  stores  the  duplicates  of  all  L2  tags  in  the  cache  set 
that  A  maps  to  from  all  the  tiles.  In  this  example,  A  maps 
to  set  3,  and  the  duplicated  tag  entry  has  the  L2  tags  of 
set  3  from  both  tile  0  and  tile  1.  With  these  tags,  we  can 
easily  deduce  the  directory  information  of  A.  Therefore, 
we  have  a  perfect  directory  for  all  of  the  cache  blocks 
on  chip.  The  main  drawback  of  this  approach  is  the  area 
overhead,  which  we  discuss  later. 

Cache-to-cache  transfers  are  used  to  reduce  off-chip 
requests  for  local  L2  misses,  but  these  operations  require 
three-way  communication  between  the  requester  tile,  the 
directory  tile,  and  the  owner  tile.  This  operation  is  more 
costly  than  hits  to  global  locations  in  a  shared  design, 
where  a  three-way  cache-to-cache  transfer  only  occurs 
if  the  block  is  held  exclusive. 

2.2.  Shared  Scheme 

In  the  shared  design,  all  of  the  L2  slices  are  man¬ 
aged  as  a  single  shared  L2  cache  with  addresses  inter¬ 
leaved  across  tiles.  The  shared  design  is  used  by  a  num¬ 
ber  of  existing  CMP  designs  [4,  18,  25,  21],  where  sev¬ 
eral  processor  cores  share  a  banked  L2  cache.  Figure  4 
shows  the  implementation  in  detail.  Each  address  is  stat¬ 
ically  mapped  to  a  home  tile  (identical  to  the  private  di¬ 
rectory  mapping  scheme).  On  an  LI  cache  miss,  the 
fetch  request  is  processed  by  the  L2  slice  at  the  requested 
block’s  home  tile.  Latency  to  the  L2  slice  varies  accord¬ 
ing  to  network  congestion  and  the  number  of  network 
hops  between  the  requesting  processor  and  the  home 
tile.  All  the  LI  caches  are  kept  coherent  using  additional 
directory  bits  on  each  L2  block,  which  track  which  tiles 
have  remote  copies.  For  the  8-node  implementation  used 
in  our  evaluation,  this  adds  3  state  bits  and  an  8-bit  shar¬ 
ing  vector  to  each  L2  cache  line.  The  overhead  of  the 
sharing  vector  will  grow  as  the  processor  count  grows, 
but  a  number  of  previously  proposed  techniques  could 
be  used  to  reduce  directory  overhead.  Some  requests 
are  satisfied  using  cache-to-cache  transfers  between  LI 
caches  using  reply-forwarding. 

2.3.  Victim  Replication 

Victim  replication  (VR)  [27]  is  a  simple  hybrid 
scheme  that  tries  to  combine  the  large  capacity  of  the 
shared  design  with  the  low  hit  latency  of  the  private  de¬ 
sign.  Victim  replication  is  based  on  the  shared  design, 
but  in  addition  it  tries  to  capture  LI  evictions  in  the  local 


L2  slice.  Each  retained  victim  is  a  local  L2  replica  of  a 
block  that  already  exists  in  the  L2  cache  of  the  remote 
home  tile. 

When  a  processor  misses  in  the  shared  L2  cache,  a 
block  is  brought  in  from  memory  and  placed  in  the  on- 
chip  L2  of  the  home  tile,  as  in  the  shared  design.  The 
requested  block  is  also  directly  forwarded  to  the  pri¬ 
mary  cache  of  the  requesting  processor.  If  the  block’s 
residency  in  the  primary  cache  is  terminated  because  of 
an  incoming  invalidation  or  writeback  request,  we  sim¬ 
ply  follow  the  usual  protocol  of  the  shared  design.  If  a 
primary  cache  block  is  evicted  because  of  a  conflict  or 
capacity  miss,  VR  attempts  to  keep  a  copy  of  the  vic¬ 
tim  block  in  the  local  slice  to  reduce  subsequent  access 
latency  to  the  same  block. 

While  a  replica  can  be  created  for  for  all  primary 
cache  victims,  VR  never  evicts  a  global  block  with  re¬ 
mote  sharers  in  favor  of  a  local  replica,  as  an  actively 
shared  global  block  is  likely  to  be  in  use.  VR  also  never 
replicates  a  victim  whose  home  tile  happens  to  be  local. 

All  primary  cache  misses  must  now  first  check  the 
local  L2  tags  in  case  there’s  a  valid  replica.  On  a  replica 
miss,  the  request  is  forwarded  to  the  home  tile.  On  a 
replica  hit,  the  replica  is  invalidated  in  the  local  L2  slice 
and  moved  into  the  primary  cache.  When  a  downgrade 
or  invalidation  request  is  received  from  the  home  tile,  the 
L2  tags  must  also  be  checked  in  addition  to  the  primary 
cache  tags. 

3.  Victim  Migration 

Victim  Migration  (VM)  is  a  flexible  hybrid  cache  ar¬ 
chitecture  that  can  dynamically  adapt  to  mimic  the  be¬ 
havior  of  pure  shared  or  pure  private  designs.  Figure  5 
shows  the  cache  hierarchy  arrangement  for  VM.  Each 
L2  consists  of  a  tag  array,  a  data  array,  and  directory 
bits,  similar  to  the  shared  design.  In  addition,  each  L2 
also  has  an  extra  set  of  tags,  we  refer  to  them  as  the  VM 
tag  array.  For  now,  we  assume  that  the  size  and  associa¬ 
tivity  of  the  the  VM  tag  array  is  identical  to  the  regular 
L2  tags. 

Similar  to  previous  designs,  each  cache  block  is  stat¬ 
ically  mapped  to  a  home  tile  in  VM,  which  holds  the 
block  in  one  of  two  forms.  First,  it  can  be  managed  ex¬ 
actly  like  the  shared  design  Second,  if  the  block  is  being 
actively  shared  by  another  tile,  either  as  an  regular  LI 
cache  block  or  an  L2  replica,  the  L2  cache  may  only 
store  the  tag  of  the  block  in  the  VM  tag  array.  By  using 
the  VM  tag  array,  victim  migration  removes  the  unnec¬ 
essary  duplication  of  data  at  the  home  tile,  freeing  up 
space  to  hold  more  replicas  or  other  global  blocks.  The 
only  added  complexity  is  that  both  the  regular  tags  and 
the  VM  tag  array  must  be  searched  during  a  data  fetch. 
If  a  hit  is  found  in  the  VM  tags,  the  request  is  satisfied 


Figure  4.  The  shared  L2  design  treats  all  L2  slices  as  part  of  a  global  shared  cache. 
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Figure  5.  Illustration  of  Victim  Migration. 


through  three-way  cache-to-cache  transfers  using  reply¬ 
forwarding. 

3.1.  Management  Policies 

In  this  section,  we  provide  a  set  of  heuristics  to  effi¬ 
ciently  manage  the  cache  capacity  on-chip.  Specifically, 
we  discuss  three  policies.  First,  the  L2  refill  policy  deter¬ 
mines  where  to  place  a  cache  block  when  the  L2  receives 
a  memory  reply  for  an  L2  miss.  Second,  the  LI  eviction 
policy  determines  whether  to  replicate  (and  if  so,  where 
to  keep)  an  LI  victim  in  the  local  L2  slice.  Third,  if  the 
local  L2  slice  decides  not  to  replicate  an  LI  victim  and 
sends  it  back  to  the  home  tile,  the  remote  tile  writeback 
policy  determines  where  to  place  the  data  if  it  is  held  in 
the  duplicated  tags.  We  assume  that  all  three  policies 
first  look  for  invalid  blocks  in  the  main  data  and  tag  ar¬ 
rays  as  these  spaces  can  be  used  at  no  cost.  The  policies 
describe  below  will  only  be  followed  if  there  are  no  in¬ 
valid  blocks. 


L2  Refill  Policy 

The  L2  refill  policy  is  rather  simple.  We  first  look  for 
an  invalid  VM  tag  to  hold  the  tag  of  the  address  only, 
and  forward  the  data  to  the  requesting  tile.  If  all  the  VM 
tags  are  used,  we  randomly  evict  a  block  in  the  main  data 
array  and  write  back  any  dirty  data  to  memory  if  needed. 


LI  Eviction  Policy 

The  LI  eviction  policy  determines  whether  to  replicate 
an  LI  victim,  and  if  so,  where  to  hold  it  in  the  local  L2 
slice.  We  first  simultaneously  search  for  an  invalid  VM 
tag  and  an  actively  shared  block  in  the  regular  tags.  If 
both  exist,  the  tag  of  the  actively  shared  block  can  be 
moved  to  the  invalid  VM  tag  entry  without  losing  infor¬ 
mation.  The  LI  victim  can  safely  overwrite  the  shared 
block’s  local  data.  As  no  data  is  evicted  from  the  local 
L2  cache,  this  operation  should  not  cause  performance 
degradation.  The  only  minor  effect  may  come  from  the 
possibly  longer  hit  latency  required  to  perform  a  three- 
way  cache-to-cache  transfer  when  a  remote  request  hits 
in  the  VM  tags  when  the  line  was  previously  stored  in 
the  regular  tags. 

If  the  above  scenario  is  not  possible  and  we  must 
evict  a  valid  block,  we  look  to  replace  the  following  two 
classes  of  block  in  descending  priority  order:  1)  a  global 
block  with  no  LI  sharers;  2)  an  existing  replica  block. 
If  neither  of  the  two  classes  of  block  exist,  we  do  not 
replicate  the  LI  victim.  This  approach  is  similar  to  the 
one  used  in  [27] . 

Remote  Tile  Writeback  Policy 

This  policy  is  used  whenever  a  tile  has  to  evict  a  line 
back  to  the  home  node,  either  from  its  primary  cache 
when  no  replica  can  be  created,  or  from  the  victim  repli¬ 
cas  when  they  are  themselves  evicted.  If  the  line  is  al- 


ready  held  in  the  regular  tag  and  data  array,  we  perform 
a  conventional  update.  If  the  tag  is  held  in  VM  tag  array 
and  another  tile  still  has  a  copy  of  the  data,  we  simply 
update  the  directory  information  in  the  VM  tag.  How¬ 
ever,  if  the  last  on-chip  copy  of  a  cache  block  is  sent 
home  and  its  tag  is  kept  in  the  VM  tag  array,  we  must 
decide  if  and  where  to  keep  this  unique  copy. 

We  first  look  for  an  actively  shared  global  block, 
which  currently  does  not  need  the  data  array  space  to 
store  its  data.  This  global  block  can  be  swapped  with 
the  remote  writeback.  If  we  can  find  such  a  swap,  no 
data  is  evicted  from  the  chip. 

If  this  scenario  is  not  possible,  we  use  the  approach 
outlined  in  the  LI  eviction  policy  to  look  for  unowned 
blocks,  then  replica  blocks  to  replace.  If  a  replica  is  re¬ 
placed,  there  can  be  a  ripple  effect  as  the  evicted  replica 
is  written  back  to  its  own  home  node. 

If  no  unowned  blocks  or  replicas  are  found,  we  again 
choose  not  to  evict  actively  shared  blocks  as  they  are 
likely  to  be  in  the  active  working  set.  In  this  case,  the  re¬ 
mote  tile  writeback  is  evicted  from  the  chip  and  written 
back  to  memory  if  necessary. 

4.  Experimental  Methodology 

In  this  section,  we  describe  the  simulation  frame¬ 
work  we  used  to  evaluate  the  alternative  cache  designs. 
To  present  a  clearer  picture  of  memory  system  behav¬ 
ior,  we  use  a  simple  in-order  processor  model  and  fo¬ 
cus  on  the  average  raw  memory  latency  seen  by  each 
memory  request.  Naturally,  overall  system  performance 
can  only  be  determined  by  co-simulation  with  a  detailed 
processor  model,  though  we  expect  the  overall  perfor¬ 
mance  trend  to  follow  average  memory  access  latency. 
Prefetching,  decoupling,  non-blocking  caches,  and  out- 
of-order  execution  are  well-known  microarchitectural 
techniques  which  overlap  memory  latencies  to  reduce 
their  impact  on  performance.  However,  machines  us¬ 
ing  these  techniques  complete  instructions  faster,  and 
are  therefore  relatively  more  sensitive  to  any  latencies 
that  cannot  be  hidden.  Also,  these  techniques  are  com¬ 
plementary  to  the  victim  migration  scheme,  and  cannot 
provide  the  same  benefit  of  reducing  cross-chip  traffic. 

4.1.  Simulator  Setup  and  Parameters 

We  have  implemented  a  full-system  execution-driven 
simulator  based  on  the  Bochs  [19]  system  emulator. 
We  added  a  cycle-accurate  cache  and  memory  simula¬ 
tor  with  detailed  models  of  the  primary  caches,  the  L2 
caches,  the  2D  mesh  network,  and  the  DRAM.  Both 
instruction  and  data  memory  reference  streams  are  ex¬ 
tracted  from  Bochs  and  fed  into  the  detailed  memory 
simulator  at  run  time.  The  combined  limitations  of 


Bochs  and  our  Linux  port  restricts  our  simulations  to 
8  processors.  Results  are  obtained  by  running  Linux 
2.4.24  compiled  for  an  x86  processor  on  an  8-way  tiled 
CMP  arranged  in  a  4x2  grid. 

Four  cache  configurations  are  simulated  for  each  dif¬ 
ferent  design,  as  summarized  in  Table  1.  To  simplify 
result  reporting,  all  latencies  are  scaled  to  the  access 
time  of  the  primary  cache,  which  can  be  reached  within 
a  single  clock  cycle.  We  assume  a  70  nm  technology 
based  on  BPTM  [10],  thus  use  a  a  16F04  clock  cy¬ 
cle  for  configuration  1,  which  has  a  smaller  16KB  LI 
cache,  and  a  24F04  clock  cycle  for  configurations  2 
through  4.  Both  of  these  cycle  times  represent  modern 
power-performance  balanced  pipeline  designs  [11,  24], 
High-frequency  designs  might  target  a  cycle  time  of  8- 
12  F04  delays  [13,  23],  in  which  case  cycle  latencies 
can  be  multiplied  appropriately.  Five  cycle  access  la¬ 
tency  is  used  for  a  256KB  L2  cache  and  six  cycles  for 
512KB  and  1MB  caches.  We  also  scale  all  latencies  ap¬ 
propriately  for  configuration  4,  in  which  the  LI  cache  is 
smaller.  We  model  each  hop  in  the  network  as  taking  3 
cycles,  including  the  router  latency  and  an  optimally- 
buffered  inter-tile  copper  wire  on  a  high  metal  layer. 
Note  that  the  worst  case  contention-free  L2  hit  latency 
is  between  29  to  32  cycles  for  these  configurations,  hint¬ 
ing  that  even  a  small  reduction  in  cross-chip  accesses 
could  lead  to  significant  performance  gains.  The  L2 
set-associativity  (16-way)  was  chosen  to  be  larger  than 
the  number  of  tiles  to  reduce  cache  conflicts  between 
threads.  For  L2  associativities  of  8  or  less,  we  found 
several  workloads  had  significant  inter-thread  conflicts, 
reflected  by  high  off-chip  miss  rates. 

4.2.  Workloads 

Table  2  summarizes  the  single-threaded,  multi¬ 
threaded,  and  multi-programmed  workloads  used  to 
evaluate  the  designs.  All  12  SpecINT2000  benchmarks 
are  used  as  single-threaded  workloads.  They  are  com¬ 
piled  with  the  Intel  C  compiler  (version  8.0.055)  us¬ 
ing  -03  -static  -ipo  -mpl  +FDO  and  use  the 
MinneSPEC  large-reduced  dataset  as  input.  The  multi- 
programmed  workloads  are  also  created  using  a  mixture 
of  the  single-threaded  SpecINT2000  benchmarks.  Each 
workload  consists  of  8  different  programs,  chosen  at  ran¬ 
dom. 

All  12  SpecINT2000  benchmarks  are  used  as  single- 
threaded  workloads.  They  are  compiled  with  the  In¬ 
tel  C  compiler  (version  8.0.055)  using  -03  -static 
-ipo  -mpl  +FDO  and  use  the  MinneSPEC  large- 
reduced  dataset  as  input.  The  multiprogrammed  work¬ 
loads  are  also  created  using  a  mixture  of  the  single- 
threaded  SpecINT2000  benchmarks.  Each  workload 
consists  of  8  different  programs,  chosen  at  random. 


All  workloads  were  invoked  in  a  runlevel  without 
superfluous  processes/daemons  to  prevent  non-essential 
processes  from  interfering  with  the  workload.  Each  sim¬ 
ulation  begins  with  the  Linux  boot  sequence,  but  results 
are  only  gathered  after  the  workload  begins  execution 
until  completion. 

Due  to  the  long  running  nature  of  the  workloads,  we 
used  a  sampling  technique  to  reduce  simulation  time. 
We  extend  the  functional  warming  method  for  super¬ 
scalars  [26]  to  an  SMP  system,  and  fast-forward  through 
periods  of  execution  while  maintaining  cache  and  direc¬ 
tory  state  [3],  At  the  start  of  each  measurement  sample, 
we  run  the  detailed  timing  model  for  10,000  cycles  to 
warm  up  the  pipelines  for  the  cache,  DRAM,  and  net¬ 
work  models.  After  this  warming  phase,  we  gather  de¬ 
tailed  statistics  for  one  million  instructions,  before  re¬ 
entering  fast-forward  mode.  Detailed  samples  are  taken 
at  random  intervals  during  execution  and  include  33% 
of  all  instructions  executed,  i.e.,  fast-forward  intervals 
average  around  five  million  instructions.  The  number 
of  samples  taken  for  each  workload  ranges  from  around 
150  to  1,000.  To  minimize  the  bias  in  the  results  intro¬ 
duced  by  system  variability  [2],  we  ran  multiple  runs  of 
each  workload  with  varying  sample  lengths  and  frequen¬ 
cies.  Results  show  that  the  variability  is  insignificant  for 
our  workloads. 

5.  Simulation  Results 

In  this  section,  we  present  the  results  of  all  four 
designs  for  multi-threaded,  single-threaded,  and  multi- 
programmed  workloads. 

5.1.  Multi-Threaded  Workloads 

Figures  6  to  9  show  the  key  result,  the  average  mem¬ 
ory  access  latency  seen  by  a  processor.  The  mini¬ 
mum  latency  is  one  cycle,  when  all  accesses  hit  in  the 
LI  cache.  In  the  following,  we  take  Configuration  1 
(8+8/256/16F04),  and  give  a  simple  analysis  of  how 
each  cache  design  alternative  worked.  Figure  6  shows 
the  access  latency,  and  Figure  10  shows  the  memory  ac¬ 
cesses  breakdown.  In  Figure  10,  a  cache  access  is  cate¬ 
gorized  as  either  an  LI  hit,  an  L2  hit,  or  a  miss  that  must 
access  off-chip  memory.  An  L2  hit  can  be  either  a  fast 
local  L2  hit  or  a  slower  non-local  hit  to  an  L2  slice  on  a 
different  tile.  Hits  through  cache-to-cache  transfers  are 
considered  non-local  and  hits  in  the  replicated  victims 
are  considered  local. 

Analysis 

From  Figure  6,  we  observe  that  for  the  majority  of  the 
workloads,  the  difference  in  performance  between  pri¬ 
vate  and  shared  designs  is  significant.  One  workload, 


IS,  has  a  working  set  that  fits  in  the  LI  cache,  thus  L2 
policies  do  not  matter  as  most  accesses  are  LI  hits.  Av¬ 
erage  access  latency  in  this  case  is  roughly  one  cycle  for 
all  four  designs. 

Compared  to  the  shared  design  (second  bar  in  Fig¬ 
ure  10),  the  private  design  (first  bar  in  Figure  10)  has 
higher  off-chip  miss  rates  but  also  many  more  local  hits 
across  the  workloads.  We  expect  the  private  design  to 
win  if  the  difference  in  the  off-chip  miss  rate  is  small 
compare  to  the  extra  number  of  local  hits  it  has  com¬ 
pared  with  the  shared  design.  This  is  the  case  for  work¬ 
loads  BT,  CG,  EP,  FT,  LU,  and  apache.  On  the  other 
hand,  if  the  difference  in  off-chip  miss  rate  is  signifi¬ 
cant,  we  expect  the  shared  design  to  win  even  though 
it  has  many  fewer  local  hits,  as  it  minimizes  expensive 
off-chip  misses.  This  is  the  case  for  workloads  MG,  SP, 
dbench,  and  checkers. 

Both  victim  replication  (third  bar  in  Figure  10)  and 
victim  migration  (fourth  bar  in  Figure  10)  create  replicas 
for  reduced  hit  latency  (shown  by  the  increase  in  local 
L2  hits  from  the  shared  design)  at  the  expense  of  slightly 
increased  off-chip  miss  rate. 

Victim  migration  works  better  than  victim  replication 
for  all  of  these  workloads.  VM  stores  the  tags  of  actively 
shared  cache  blocks  in  the  VM  tag  array,  thus  vacates 
some  of  the  actual  data  storage  space.  This  space  is  split 
between  replicas  and  unshared  global  blocks.  Having 
more  replicas  is  likely  to  increase  the  number  of  local 
L2  hits,  and  having  more  global  blocks  is  likely  to  re¬ 
duce  the  off-chip  miss  rate.  Both  of  the  scenarios  can 
be  observed  by  comparing  the  third  and  fourth  bars  in 
Figure  10. 

Other  Configurations 

As  we  increase  the  on-chip  capacity,  whether  private  or 
shared  works  better  changes  even  for  the  same  work¬ 
load  depending  on  whether  its  working  set  fits  into  the 
given  cache  size.  For  EP,  the  shared  design  does  bet¬ 
ter  in  larger  caches  as  more  of  its  working  set  fits  on- 
chip.  For  SP,  the  private  design  starts  to  win  with  larger 
caches  when  more  of  its  working  set  fits  into  the  1MB 
local  L2  slice.  VM  manages  to  be  the  best  policy  for 
most  of  these  workloads.  A  performance  summary  is 
shown  in  Table  3. 

5.2.  Single-Threaded  Workloads 

Figures  1 1  to  14  show  the  average  access  latency  for 
the  single-threaded  workloads.  Because  they  have  work¬ 
ing  sets  that  are  generally  smaller  than  even  the  smallest 
configuration  simulated  (2MB),  VM  did  not  provide  no¬ 
ticeable  improvement  over  VR.  However,  VM  is  either 


Latency  (Cycles)  Latency  (Cycles) 


Average  Data  Access  Latency  Average  Data  Access  Latency 


Figure  6.  Access  latencies  of  multi-threaded  work¬ 
loads.  Configuration  1:  8+8/256/16F04. 


Figure  7.  Access  latencies  of  multi-threaded  work¬ 
loads.  Configuration  2:  16+16/256K/24F04. 
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Figure  8.  Access  latencies  of  multi-threaded  work¬ 
loads.  Configuration  3:  16+1 6/5 12K/24F04. 


Figure  9.  Access  latencies  of  multi-threaded  work¬ 
loads.  Configuration  4:  16+16/1024/24FO4. 


the  best  policy  or  a  very  close  second  across  all  bench¬ 
marks. 

5.3.  Multi-Programmed  Workloads 

Multiprogrammed  workloads  tend  to  have  very  little 
sharing  among  the  different  threads,  thus  the  private  de¬ 
sign  is  likely  to  do  significantly  better  than  the  shared 
design.  Figures  15  to  18  confirm  this  intuition,  where 
the  shared  design  is  always  the  worst  by  a  large  margin. 

Victim  replication’s  performance  is  close  to  the  pri¬ 
vate  design,  usually  within  5%.  Because  there  is  very 
little  sharing,  each  home  block  is  generally  used  by  only 
one  tile,  meaning  that  the  cache  block  is  stored  twice 
on  the  chip.  This  duplication  significantly  reduces  the 
effective  capacity,  making  victim  replication  unlikely  to 
win  over  the  private  design.  This  effect  is  better  demon¬ 
strated  in  the  smaller  cache  sizes,  where  capacity  is  at  a 
higher  premium. 

Compared  to  victim  replication,  victim  migration 
eliminates  the  need  to  keep  a  duplicate  copy  at  the  home 
tile,  behaving  just  like  the  private  design  when  neces¬ 
sary.  In  addition,  VM  allows  data  to  be  stored  at  a  global 
location,  stealing  limited  capacity  from  other  threads 
when  their  working  sets  do  not  saturate  their  local  L2 
slice.  While  more  flexible  capacity  stealing  techniques 
have  been  proposed  in  [8,  22],  they  are  based  on  snoop¬ 
ing  coherence  protocols,  in  which  locating  a  cache  block 
on-chip  is  relatively  easy.  In  a  directory-based  design, 
however,  this  global  search  is  likely  to  be  very  complex 
and  expensive.  Overall,  victim  migration  is  the  best  pol¬ 
icy  for  almost  all  workloads. 

5.4.  Reducing  VM  Tag  Array  Area  Overhead 

The  main  drawback  of  victim  migration  is  the  area 
overhead  caused  by  the  VM  tag  array.  For  simplicity, 
we  have  so  far  assumed  that  the  VM  tag  array  size  is 
identical  to  the  regular  L2  tags.  However,  the  VM  tag 
array  can  be  of  any  size  and  associativity.  We  selected 
configuration  3  ( 16+1 6/5 12/24F04)  and  simulated  the 
performance  of  two  additional  VM  tag  array  sizes:  at 
50%,  and  25%  of  the  regular  tag  array  size.  The  50% 
case  caused  no  performance  degradation.  The  25%  case 
lost  about  15%  of  the  latency  reduction  achieved  by  the 
full  VM  tag  array  over  victim  replication.  We  also  ex¬ 
perimented  with  higher  VM  tag  array  associativities  and 
observed  no  noticeable  gains. 

5.5.  Area  Comparison  of  Designs 

The  vast  majority  of  the  area  in  caches  are  occupied 
by  data  arrays,  peripheral  circuitry,  and  interconnects, 
which  is  the  same  for  all  four  designs  described  in  this 
paper.  However,  the  tag  bits,  status  bits,  and  in  our  case. 


directory  bits,  all  take  up  non-negligible  space.  In  this 
section,  we  provide  a  simple  quantified  comparison  of 
the  area  occupies  by  the  tags  and  directories  for  each  of 
these  designs.  We  use  the  parameters  in  configuration  1 
(8+8/256/16F04)  in  the  comparison.  We  also  assume  a 
40-bit  physical  address  width  and  64  byte  cache  block 
size,  which  are  modest  representatives  of  modern  CMP 
machines  [25]. 

Table  4  shows  the  tag  and  directory  area  estimates  in 
bits/block  used  for  each  design.  It  also  shows  the  total 
bits  overhead  compared  to  the  shared  design,  which  re¬ 
quires  the  least  area.  The  actual  overall  cache  area  over¬ 
head  is  likely  to  be  much  smaller  than  the  ones  in  Table  4 
when  the  area  of  peripheral  circuitry  and  interconnects 
are  taken  into  consideration. 

In  the  shared  design,  the  address  is  used  to  index  a 
single  large  shared  cache,  the  width  of  the  tag  is  smaller 
than  that  of  the  private  design.  In  the  8-tile  configura¬ 
tion,  three  bits  are  used  to  select  a  home  tile,  making 
the  shared  tag  3  bits  shorter  than  that  of  the  private  de¬ 
sign.  The  directory  uses  an  8-bit  wide  sharing  vector.  It 
also  leverages  the  existing  valid  and  dirty  status  bits  to 
represent  state,  adding  only  one  extra  state  bit  in  our  de¬ 
sign,  for  a  total  of  a  9-bit  directory.  The  private  design 
uses  the  largest  area,  by  having  a  wider  tag  and  a  fully 
duplicated  tag  array  to  maintain  the  on-chip  directory. 

For  victim  replication,  the  L2  tag  must  be  wide 
enough  to  hold  physical  addresses  from  any  home  tile, 
thus  the  tag  width  becomes  the  same  as  the  private  de¬ 
sign.  Global  L2  blocks  redundantly  set  these  bits  to 
the  address  index  of  the  home  tile.  Replicas  of  remote 
blocks  can  be  distinguished  from  regular  L2  blocks  as 
their  additional  tag  bits  do  not  match  the  local  tile  index. 
The  full  version  of  victim  migration  incurs  the  largest 
area  overhead  of  all  four  designs.  It  consists  of  all  of 
the  components  used  in  victim  replication,  as  well  as  the 
VM  tag  array.  However,  the  overhead  can  be  reduced  to 
less  than  that  of  the  private  design  by  halving  the  size  of 
the  VM  tag  array  with  no  performance  degradation. 

6.  Related  Work 

Data  migration  techniques  [17,  7]  discussed  in  the  in¬ 
troduction  could  have  poor  performance  when  applied 
to  tiled  CMPs.  A  given  L2  block  may  be  repeatedly  ac¬ 
cesses  by  cores  at  opposite  corners  of  the  die.  A  recent 
study  [5]  investigates  the  behavior  of  block  migration 
in  CMPs  using  a  variant  of  D-NUCA,  but  the  proposed 
protocol  is  complex  and  relies  on  a  “smart  search”  al¬ 
gorithm  for  which  no  practical  implementation  is  given. 
The  benefits  are  also  limited  by  the  tendency  for  shared 
data  to  migrate  to  the  center  of  the  die. 

Several  proposals  advocate  data  replication  [8,  27, 
22],  which  allows  sharers  to  replicate  local  copies  of 


shared  data  for  fast  access.  Victim  replication  has 
been  already  discussed  in  detail  in  this  paper.  CMP- 
NuRAPID  [8]  extends  NuRAPID  to  support  data  repli¬ 
cation  for  CMPs  based  on  a  snooping  coherence  pro¬ 
tocol.  However,  the  actual  implementation  is  complex 
and  incurs  a  large  area  overhead.  In  the  baseline  IBM 
Power4  scheme  [25],  each  node  has  a  non-inclusive  L3 
cache  that  stores  the  local  L2  victims.  However,  while 
L3s  can  be  snooped  by  other  nodes,  the  local  L2  vic¬ 
tim  always  overwrite  the  local  L3,  causing  considerable 
pressure  on  the  L3  cache  and  reduces  the  effective  L3 
capacity.  In  [22],  this  baseline  scheme  is  improved  by 
using  a  small  history  table  to  selectively  remove  some 
clean  writebacks  with  data  present  in  the  L3. 

Data  replication  also  bears  resemblance  to  earlier 
work  on  remote  data  caching  in  conventional  CC- 
NUMA  and  COMA  architectures  [20,  9,  28],  which  also 
try  to  retain  local  copies  of  data  that  would  otherwise  re¬ 
quire  a  remote  access.  There  are  two  major  differences 
in  the  CMP  structure,  however,  that  limit  the  applicabil¬ 
ity  of  prior  remote  caching  work.  First,  in  CC-NUMAs, 
all  the  local  cache  on  a  node  is  private  so  the  alloca¬ 
tion  between  local  and  remote  data  only  affects  the  lo¬ 
cal  node.  In  a  CMP,  on-chip  L2  capacity  is  shared  by 
all  nodes,  and  so  a  local  node’s  replacement  policy  af¬ 
fects  cache  performance  of  all  nodes.  Second,  in  both 
CC-NUMA  and  COMA  systems,  remote  data  is  further 
away  than  local  DRAM,  thus  it  is  beneficial  to  use  a 
large  remote  cache  held  in  local  DRAM.  In  addition, 
the  cost  of  adding  a  remote  cache  is  low  and  does  not 
diminish  the  performance  of  existing  L2  caches.  In  the 
CMP  structure,  the  remote  caches  are  closer  to  the  local 
node  than  any  DRAM,  and  any  replication  reduces  the 
effective  cache  capacity  for  blocks  that  will  have  to  be 
fetched  from  slow  off-chip  memory. 

Several  related  studies  include  [14],  which  discusses 
the  design  space  for  future  CMPs  with  private  L2  caches. 
In  [15],  a  hardware  hybrid  of  shared  and  private  designs 
is  proposed,  in  which  a  statically  determined  sharing  de¬ 
gree  partitions  the  CMP  into  private  regions,  while  each 
region  itself  maybe  contain  multiple  processor  core  that 
share  all  of  the  cache  resource  within  the  region. 

7.  Conclusion 

As  wire  delay  continue  to  worsen  in  future  micropro¬ 
cessors,  both  off-chip  accesses  and  on-chip  communi¬ 
cations  must  be  carefully  managed  to  efficiently  utilize 
the  on-chip  cache  capacity.  Typical  private  and  shared 
designs  either  have  low  hit  latency  or  low  off-chip  miss 
rate,  but  not  both,  prompting  hybrid  techniques  trying 
to  combine  the  advantages  of  both.  This  paper  intro¬ 
duces  victim  migration,  a  hybrid  technique  that  extends 
the  previously  proposed  victim  replication  scheme.  Vic¬ 


tim  migration  uses  an  extra  migration  tag  array  to  re¬ 
move  the  need  to  duplicate  shared  data  at  the  home  tile, 
freeing  significant  space  for  storing  additional  replicas 
and  global  data.  We  show  that  victim  migration  is  the 
best  overall  policy  across  a  wide  range  of  workloads. 
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Component 

Parameter  [ 

Configuration  1 

Configuration  2 

Configuration  3 

Configuration  4 

8+8/256/1 6F04 

16+1 6/256/24F04 

16+1 6/5 12/24F04 

16+16/1 024/24FO4 

LI  I-Cache  Size/Associativity 

8  KB/16- way 

16  KB/ 16- way 

16  KB/ 16- way 

16  KB/ 16- way 

LI  D-Cache  Size/Associativity 

8  KB/ 16- way 

16  KB/ 16- way 

16  KB/16-way 

16  KB/ 16- way 

LI  Load-to-Use  Latency 

1  cycle 

1  cycle 

1  cycle 

1  cycle 

LI  Replacement  Policy 

Psuedo-LRU 

Psuedo-LRU 

Psuedo-LRU 

Psuedo-LRU 

L2  Cache  Slice  Size/Associativity 

256  KB/ 16- way 

256KB/16-way 

512KB/16-way 

1  MB/ 16- way 

L2  Load-to-Use  Latency  (per  slice) 

8  cycles 

5  cycles 

6  cycles 

6  cycles 

L2  Replacement  Policy 

Random 

Random 

Random 

Random 

External  memory  latency 

192  cycles 

128  cycles 

128  cycles 

128  cycles 

One-hop  latency 

3  cycles 

3  cycles 

3  cycles 

3  cycles 

Worst  case  L2  hit  latency  (contention-free) 

32  cycles 

29  cycles 

30  cycles 

30  cycles 

CMP  Configuration 

4x2  Mesh 

Processor  Model 

in 

-order 

Cache  Line  Size 

64  B 

Table  1 .  Simulation  parameters.  The  numbers  for  each  configuration  represent  the  cache  sizes  and  cycle  times.  For  example, 
8+8/256/16F04  reads  8K  L1I  cache,  8K  LID  cache,  256K  L2  cache,  with  16  F04-delay  cycle  time. 


Benchmark 

(Instruction  Count  in  Billions) 

Description 

|  Multi-Threaded  Benchmarks  | 

BT 

(1.7) 

class  S.  block- tridiagonal  CFD  application 

CG 

(5.0) 

class  W.  conjugate  gradient  kernel 

EP 

class  W.  embarassingly  parallel  kernel 

FT 

class  S.  3X  ID  fast  fourier  transform  (-00) 

IS 

class  W.  integer  sort.  (icc-v8) 

LU 

class  R.  LU  decomposition,  with  SSOR  CFD  application 

MG 

(5.1) 

class  W.  multigrid  kernel 

SP 

(6.7) 

class  R.  scalar  pentagonal  CFD  application 

apache 

(3.3) 

Apache’s  ’ab’  worker  threading  model,  2000  requests,  3  at  a  time  (gcc  2.96) 

dbench 

(3.3) 

executes  Samba-like  syscalls,  3  clients,  10000  requests  (gcc  2.96) 

checkers 

(2.9) 

Cilk  checkers  (parallel  a  —  fl  search).  Black  6,  While  5  (Cilk  5.3.2,  gcc  2.96) 

|  Single-Threaded  Benchmarks  [ 

bzip 

(3.8) 

bzip2  compression  algorithm  version  0.1 

crafty 

(1.2) 

A  high-performance  chess  program  designed  around  a  64-bit  word 

eon 

(2.9) 

A  probabilistic  ray  tracer 

gap 

(1.1) 

A  language  and  library  designed  mostly  for  computing  in  groups 

gcc 

(6.4) 

gcc  compiler  version  2.7. 2. 2  for  Motorola  88100  processor 

Data  compression  program  using  LZ77  for  GNU 

Single-depot  vehicle  scheduling  algorithm  in  public  mass  transportation 

|  parser 

(5.6) 

A  word  processing  parsing  tool 

A  cut-down  version  of  Perl  v5.005_03  without  most  OS-specific  features 

twolf 

(1.5) 

The  TimberWolfSC  place  and  route  tool 

vortex 

(1.5) 

An  single-user  object-oriented  database  transaction  program 

vpr 

(5.3) 

A  place  and  route  tool  for  FPGAs 

|  Multi-Programmed  Benchmarks  | 

mixO 

(23.9) 

bzip,  crafty,  eon,  gap,  gcc,  gzip,  mcf,  parser 

mixl 

(24.8) 

gcc,  gzip,  mcf,  parser,  perlbmk,  twolf,  vortex,  vpr 

mix2 

(19.1) 

bzip,  crafty,  eon,  gap,  perlbmk,  twolf,  vortex,  vpr 

mix3 

(22.8) 

bzip,  gap,  mcf,  twolf,  crafty,  gcc,  parser,  vortex 

mix4 

(19.1) 

bzip,  gap,  mcf,  twolf,  eon,  gzip,  perlbmk,  vpr 

crafty,  gcc,  parser,  vortex,  eon,  gzip,  perlbmk,  vpr 

|  mix6 

(12.7) 

crafty,  eon,  gap,  gzip,  mcf,  perlbmk,  twolf,  vortex 

bzip,  gap,  gzip,  mcf,  parser,  twolf,  vortex,  vpr 

|  mix8 

(28.0) 

bzip,  crafty,  eon,  gap,  gcc,  mcf,  parser,  vpr 

Table  2.  Benchmarks  Descriptions.  Three  different  categories  of  benchmarks  are  used:  single-threaded 
benchmarks,  multi-threaded  benchmarks,  and  multi-programmed  benchmarks. 


Breakdown  (%) 


Figure  10.  Memory  access  breakdown  of  multi-threaded  programs.  Moving  from  left  to  right,  the  four  bars  for  each 
workload  are  for  the  private  design,  shared  design,  victim  replication,  and  victim  migration,  respectively.  Hits  from  cache- 
to-cache  transfers  are  considered  non-local  and  hits  from  replicas  are  considered  local. 
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Figure  1 1 .  Access  latencies  of  single-threaded  work¬ 
loads.  Configuration  1:  8+8/256/ 16F04. 


Figure  1 2.  Access  latencies  of  single-threaded  work¬ 
loads.  Configuration  2:  16+16/256K/24F04. 
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Figure  1 3.  Access  latencies  of  single-threaded  work¬ 
loads.  Configuration  3:  16+16/512/24F04. 


Figure  1 4.  Access  latencies  of  single-threaded  work¬ 
loads.  Configuration  4:  16+16/1024/24FO4. 
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Figure  15.  Access  latencies  of  multi-programmed  work¬ 
loads.  Configuration  1:  8+8/256/1 6F04. 
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Figure  17.  Access  latencies  of  multi-programmed  work¬ 
loads.  Configuration  3:  16+1 6/5 12/24F04. 
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Figure  16.  Access  latencies  of  multi-programmed  work¬ 
loads.  Configuration  2:  16+16/256/24F04. 
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Figure  18.  Access  latencies  of  multi-programmed  work¬ 
loads.  Configuration  4:  16+16/1024/24FO4. 


Configuration  1 
8+8/256/1 6F04 

Configuration  2 

1 6+ 1 6/256/24F04 

Configuration  3 

16+1 6/5 12/24F04 

Configuration  4 

16+16/1 024/24FO4 

Reductin  (%) 

Reductin  (%) 

Reductin  (%) 

Reductin  (%) 

Workload 

VM/S 

VM/P  VM/VR  | 

VM/S 

VM/P  VM/VR  | 

VM/S 

VM/P  VM/VR  | 

VM/S 

VM/P  VM/VR  | 

j  Multi-Threaded  Workloads  | 

BT 

45.3 

-1.9 

-0.6 

11.5 

-0.6 

0.2 

18.9 

-0.6 

0.4 

9.4 

0.0 

0.0 

CG 

30.3 

-2.7 

2.0 

30.0 

-5.6 

1.9 

37.4 

-10.3 

4.8 

77.3 

-22.1 

14.2 

EP 

4.1 

0.3 

0.3 

-0.4 

2.7 

-0.5 

1.8 

7.5 

1.1 

14.8 

34.5 

3.9 

FT 

26.4 

5.6 

7.1 

17.5 

3.5 

2.5 

25.2 

1.6 

3.6 

25.8 

12.3 

3.8 

IS 

0.8 

0.6 

0.1 

-0.0 

0.1 

0.0 

0.3 

1.3 

0.1 

0.2 

0.9 

0.5 

LU 

27.3 

9.0 

12.3 

31.6 

3.5 

14.4 

40.5 

12.2 

9.4 

50.1 

13.7 

3.7 

MG 

6.1 

9.3 

2.8 

6.0 

12.9 

3.4 

4.1 

9.2 

1.2 

9.6 

14.5 

4.3 

SP 

7.7 

22.1 

4.0 

14.6 

23.1 

5.0 

7.7 

15.5 

3.3 

16.5 

0.2 

2.0 

apache 

14.0 

-0.2 

3.6 

7.3 

10.9 

0.9 

8.8 

-0.4 

-1.1 

15.4 

0.7 

-1.6 

dbench 

4.5 

9.9 

2.2 

8.2 

6.4 

6.0 

9.6 

17.0 

0.3 

22.3 

10.7 

3.1 

checkers 

8.7 

32.5 

4.7 

8.1 

31.9 

4.0 

5.3 

36.2 

2.1 

4.9 

32.0 

1.7 

Avg 

15.9 

7.6 

3.5 

12.2 

8.1 

3.4 

14.5 

8.1 

2.3 

22.4 

8.9 

3.2 

j  Single-Threaded  Workloads  j 

bzip 

22.3 

42.3 

11.2 

17.5 

32.0 

0.8 

19.4 

26.0 

2.4 

27.6 

22.1 

4.9 

crafty 

105.0 

9.4 

0.2 

43.6 

10.4 

0.7 

42.3 

5.4 

0.5 

43.4 

1.1 

-1.7 

eon 

47.2 

0.1 

-2.1 

5.7 

1.1 

-0.5 

5.1 

2.9 

-0.1 

4.5 

1.8 

-0.5 

gap 

38.0 

16.9 

2.8 

18.9 

11.5 

0.2 

14.5 

13.9 

1.3 

18.0 

14.0 

2.6 

gcc 

42.6 

19.2 

1.1 

31.3 

14.0 

-0.4 

29.8 

10.8 

1.3 

39.5 

8.0 

-0.9 

gzip 

81.6 

41.1 

3.3 

81.6 

21.5 

-0.5 

86.2 

26.0 

0.5 

65.0 

5.8 

-0.4 

mcf 

28.5 

54.9 

7.9 

38.1 

48.6 

1.8 

49.2 

9.6 

9.3 

79.2 

11.7 

0.3 

parser 

41.7 

45.7 

3.9 

31.8 

38.1 

0.9 

34.7 

31.5 

2.2 

50.1 

18.6 

-0.4 

perl 

9.1 

11.8 

2.2 

7.8 

9.1 

1.4 

4.2 

6.8 

0.4 

3.1 

6.9 

-1.6 

twolf 

127.4 

-5.3 

1.6 

123.6 

2.3 

1.9 

124.2 

1.6 

0.5 

101.9 

-0.7 

-2.0 

vortex 

48.2 

14.8 

2.2 

28.8 

13.3 

2.6 

30.7 

6.9 

-1.6 

29.5 

5.1 

-1.8 

vpr 

78.0 

14.2 

0.2 

63.8 

11.5 

-1.1 

75.6 

9.1 

1.3 

69.2 

-0.8 

-1.1 

Avg 

55.8 

22.1 

2.9 

41.0 

17.8 

0.7 

43.0 

12.5 

1.5 

44.3 

7.8 

-0.2 

j  Multi-Programmed  Workloads  [ 

mixO 

44.6 

4.7 

5.3 

23.6 

5.1 

9.1 

30.9 

8.2 

10.6 

34.6 

2.7 

5.5 

mixl 

53.1 

6.3 

7.3 

36.3 

3.5 

12.7 

42.0 

3.6 

8.9 

49.6 

1.1 

4.4 

mix2 

59.7 

7.4 

4.1 

35.1 

4.3 

6.7 

36.0 

9.0 

5.1 

40.3 

2.7 

0.5 

mix3 

54.0 

5.4 

9.4 

31.7 

5.5 

10.6 

34.0 

5.0 

4.9 

42.8 

0.5 

4.3 

mix4 

58.2 

5.9 

10.8 

37.1 

2.9 

11.7 

34.6 

1.3 

2.9 

44.1 

-0.5 

3.1 

mix5 

54.7 

6.4 

5.9 

30.0 

5.7 

4.9 

36.2 

8.3 

3.5 

38.6 

2.6 

2.3 

mix6 

54.5 

0.1 

6.7 

30.8 

0.5 

10.7 

33.6 

-2.6 

8.0 

39.0 

-1.4 

4.1 

mix7 

63.9 

4.4 

10.0 

42.2 

5.3 

12.7 

45.3 

7.3 

12.7 

51.6 

1.5 

4.9 

mix8 

56.2 

7.8 

11.6 

26.4 

4.8 

6.6 

31.6 

2.1 

4.9 

38.9 

1.1 

4.2 

Avg 

55.4 

5.4 

7.9 

32.6 

4.2 

9.5 

36.0 

4.7 

6.8 

42.2 

1.1 

3.7 

Table  3.  Access  latency  reduction  achieved  by  victim  migration.  The  three  numbers  for  each  workload  indicate  the  percent¬ 
age  reduction  with  respect  to  the  shared  design  (VM/S),  the  private  design  (VM/P).  and  victim  replication  (VM/VR). 


Design 

Tag  width 

Dir.  entry  width 

Total  Width 

Bit/Block  Overhead  vs.  Shared 

Shared 

25 

9 

34 

0.0% 

Private 

28 

28 

56 

4.0% 

Victim  Replication 

28 

9 

37 

0.6% 

Victim  Migration  (1/1) 

28 

43 

71 

6.8% 

Victim  Migration  (1/2) 

28 

26 

54 

3.7% 

Victim  Migration  (1/4) 

28 

20 

48 

2.6% 

Table  4.  Cache  area  overhead  of  different  designs. 


